-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add thunder benchmarks #3394
base: main
Are you sure you want to change the base?
Add thunder benchmarks #3394
Conversation
!build |
!test --pybench |
Can you check how does the benchmark look like? Make sure there are no unexpected results. |
!test --pybench-full --dev see results here when pipeline finishes |
ce46a99
to
ec697f0
Compare
ec697f0
to
2d9b5d3
Compare
2d9b5d3
to
15a9f50
Compare
!test --pybench-full --dev |
d69811f
to
ff1373f
Compare
!test --pybench-full --dev |
1 similar comment
!test --pybench-full --dev |
@@ -325,6 +327,9 @@ def run_benchmark( | |||
def setup(): | |||
clear_l2_cache() | |||
if device == "cuda": | |||
for inp in inputs: | |||
if isinstance(inp, torch.Tensor): | |||
inp.grad = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this one. But this is only the cases where input requires gradient. Are we also clearing gradient on parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean, for instance, weights in layernorm? Then, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious how this works in code.
If benchmark_fn
is not a function but a torch module, in that instance, the thunder program doesn't expect parameters to be among its inputs, I think it's stored in the thunder compiled thing. So I'm not sure how that's handled.
i.e. something like this
foo = torch.nn.Linear(4, 5).cuda()
inp = torch.randn(8, 4, device="cuda")
benchmark_fn = with_executor(foo, "thunder")
# ...
run_benchmark(benchmark, unary_bwd_torch, [output, grad], ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh you're right.
Even with clearing the gradients of weights, bias and inputs in backward pass, I think it is still missing some variables/internal states that need to be reset.
The simplest way is to only run 1 round for backward, but I feel that may be noisy, so have trying to make it work for multiple runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got'ya. No worries. I'm not totally clear what's the protocol in thunder on ownership of parameters, I think it's supposed to be a functional compilation.
So we can still expect that with_executor
has the chance to extract parameters from nn.Module if it's given for benchmark and we should be able to identify parameter that needs zero_grad. just like an optimizer would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this could also contribute to potential performance diff.
if there are parameter requiring grad, thunder will generate backward graph and save intermediates, regardless of whether backward is being called or not.
!test --pybench-full --dev |
1 similar comment
!test --pybench-full --dev |
We still see some performance differences between Thunder and nvfuser manual definitions, for some operators, which may be due to slight differences in the fusion definitions generated. I will look into this operator-wise. I have compared the measured timings against nsys timings for some operators to verify the accuracy of the benchmark infra. My recommendation is to check in these changes adding Thunder benchmarks, while we investigate the performance gap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there're still quite some questions on what's the next step and how to automate things like IOBytes computation and stuff.
But for this PR, things look pretty mechanical, so I'm good with merging it as-is then iron out the remaining issues.
My only concern is, would the disruption rendering our benchmark not reliable until we fix all these issues? and does it cause issues on folks looking at the benchmarks?
@@ -115,6 +121,6 @@ def test_softmax_bwd_baseline_benchmark( | |||
run_benchmark( | |||
benchmark, | |||
unary_bwd_torch, | |||
[outputs, grads], | |||
[outputs, grads, *fwd_inputs], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sneaky! I see unary_bwd_torch
discards further inputs, but you pass them here to clear the grad?
We can use a comment in unary_bwd_torch on why we are not asserting num of inputs.
The So by performance gap, I imply the latter case. The measurements of the benchmarks should be accurate. |
Agreed! Let me open an issue tracking the IOByte case. I will create an issue for the differences between |
!build |
84bb09c
to
b7e0b3e
Compare
PR Reviewer Guide 🔍(Review updated until commit 083d89d)Here are some key observations to aid the review process:
|
!test --pybench-full --dev |
@jacobhinkle @liqiangxl pinging for review. We continue to see performance difference between Thunder-nvfuser and nvfuser. The thunder-nvfuser executor will not run by default so we can continue investigating this in parallel. This PR has the fix for bwd grad accumulation issue, so we can merge this. If the consensus is to hold off on adding Thunder-nvfuser benchmarks altogether, I can remove it from the executors list for benchmarks as well. That change is minimal. Latest benchmark run: http://nv/euP
@jjsjann123 Can you give me the numbers you expect for RoPE bwd after the grad accumulation issue is resolved -- I can verify the numbers with this PR? |
@@ -23,6 +22,8 @@ | |||
L2_CACHE_SIZE = DEVICE_PROPERTIES["gpu_l2_bytes"] | |||
PEAK_BANDWIDTH_GBPS = DEVICE_PROPERTIES["gpu_peak_bandwidth_gbps"] | |||
|
|||
DEFAULT_EXECUTORS = ["eager", "torchcompile", "thunder"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be named differently since these are not run in nightly, but for most benchmarks, these are the set of executors we execute weekly. We also have thunder-torchcompile
for RoPE.
Maybe BASELINE_EXECUTORS
is better, although Thunder is not really a baseline.
!build |
Adds
thunder
as an additional executor to the baseline benchmarks and the correspondingthunder.jit
function.The following benchmarks do not have
thunder
benchmark:instancenorm
: Unsupported operator in Thundertest_gelu_backward_reduction.py
:.backward
call is not supported within Thunder definitions. @IvanYashchuk has suggested using explicit backward computation for this case.Issue #2718